When Frequency Data Meet Dispersion Data in the Extraction of Multi-word Units from a Corpus: A Study of Trigrams in Chinese
نویسنده
چکیده
One of the main approaches to extract multi-word units is the frequency threshold approach, but the way this approach considers dispersion data still leaves a lot to be desired. This study adopts Gries’s (2008) dispersion measure to extract trigrams from a Chinese corpus, and the results are compared with those of the frequency threshold approach. It is found that the overlap between the two approaches is not very large. This demonstrates the necessity of taking dispersion data more seriously and the dynamic nature of lexical representations. Moreover, the trigrams extracted in the present study can be used in a wide range of language resources in Chinese.
منابع مشابه
Lexical Bundles in English Abstracts of Research Articles Written by Iranian Scholars: Examples from Humanities
This paper investigates a special type of recurrent expressions, lexical bundles, defined as a sequence of three or more words that co-occur frequently in a particular register (Biber et al., 1999). Considering the importance of this group of multi-word sequences in academic prose, this study explores the forms and syntactic structures of three- and four-word bundles in English abstracts writte...
متن کاملDeveloping a Corpus-Based Word List in Pharmacy Research Articles: A Focus on Academic Culture
The present corpus-based lexical study reports the development of a Pharmacy Academic Word List (PAWL); a list of the most frequent words from a corpus of 3,458,445 tokens made up of 800 most recent pharmacy texts including research articles, review articles, and short communications in four sub-disciplines of pharmacy. WordSmith (Scott, 2017) and AntWordProfiler (Anthony, 2014) were used to sc...
متن کاملProducing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations
The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...
متن کاملA Corpus-driven Food Science and Technology Academic Word List
The overarching goal of this study was to create a list of the most frequently occurring academic words in Food Science and Technology (FST). To this end, a 4,652,444-word corpus called Food Science and Technology Research Articles (FSTRA), which included 1,421 research articles (RAs) randomly selected from 38 journals across five sub-disciplines in FST, was developed. Frequency and range-based...
متن کاملIncreasing the discrimination power the decision making units based on reducing dispersion of weights in the data envelopment analysis
Data envelopment analysis which is a nonparametric technique for evaluating relative efficiency of the decision making units with multiple inputs and outputs, has been a very popular method among researchers. While this nonparametric technique is popular, it has some drawbacks such as lack of discrimination in efficient units and weights dispersion .The present study, which is a model based on ...
متن کامل